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ABSTRACT 


Deep Reinforcement Learning (DRL) has been shown to be 
a very powerful technique in recent years on a wide range 
of applications. Much of the prior DRL work took the on- 
line learning approach. However, given the challenges of 
building accurate simulations for modeling student learn- 
ing, we investigated applying DRL to induce a pedagogical 
policy through an offline approach. In this work, we ex- 
plored the effectiveness of offline DRL for pedagogical pol- 
icy induction in an Intelligent Tutoring System. Generally 
speaking, when applying offline DRL, we face two major 
challenges: one is limited training data and the other is the 
credit assignment problem caused by delayed rewards. In 
this work, we used Gaussian Processes to solve the credit 
assignment problem by estimating the inferred immediate 
rewards from the final delayed rewards. We then applied 
the DQN and Double-DQN algorithms to induce adaptive 
pedagogical strategies tailored to individual students. Our 
empirical results show that without solving the credit as- 
signment problem, the DQN policy, although better than 
Double-DQN, was no better than a random policy. How- 
ever, when combining DQN with the inferred rewards, our 
best DQN policy can outperform the random yet reasonable 
policy, especially for students with high pre-test scores. 


1. INTRODUCTION 


Interactive e-Learning Environments such as Intelligent Tu- 
toring Systems (ITSs) and educational games have become 
increasingly prevalent in educational settings. In order to 
design effective interactive learning environments, develop- 
ers must form the basic core of the system and determine 
what is to be taught and how. Pedagogical strategies are 
policies that are used to decide the how part, what action 
to take next in the face of alternatives. Each of these sys- 
tems’ decisions will affect the user’s subsequent actions and 
performance. 


Reinforcement Learning (RL) is one of the best machine 
learning approaches for decision making in interactive envi- 
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ronments and RL algorithms are designed to induce effective 
policies that determine the best action for an agent to take 
in any given situation to maximize some predefined cumu- 
lative reward. In recent years, deep neural networks have 
enabled significant progress in RL research. For example, 
Deep Q-Networks (DQNs) [26] have successfully learned to 
play Atari games at or exceeding human level performance 
by combining deep convolutional neural networks and Q- 
learning. Since then, DRL has achieved notable successes in 
a variety of complex tasks such as robotics control [1] and 
the game of Go [44]. From DQN, various DRL methods such 
as Double DQN [51] or Actor-Critic methods [38, 39] were 
proposed and shown to be more effective than the classic 
DQN. Despite DRL’s great success, there are still many chal- 
lenges preventing DRL from being applied more broadly in 
practice, including applying it to educational systems. One 
major problem is sample inefficiency of current DRL algo- 
rithms. For example, it takes DQN hundreds of millions of 
interactions with the environment to learn a good policy and 
generalize to unseen states, while we seek to learn policies 
from datasets with fewer than 800 student-tutor interaction 


logs. 


Generally speaking, there are two major categories of RL: 
online and offline. Online RL algorithms learn policy while 
the agent interacts with the environment; offline RL algo- 
rithms, by contrast, learn the policy from pre-collected train- 
ing data. Online RL methods are generally appropriate for 
domains where the state representation is clear and interact- 
ing with simulations and actual environments is relatively 
computationally cheap and feasible, so most of prior work 
on DRL mainly took an online learning approach. On the 
other hand, for domains such as e-learning, building accurate 
simulations or simulating students is especially challenging 
because human learning is a rather complex, not fully under- 
stood process; moreover, learning policies while interacting 
with students may not be feasible and more importantly, 
may not be ethical. Therefore, our DRL approach is offline. 
This approach was achieved by, first, collecting a training 
corpus. One common convention, and the one used in our 
study, is to collect an exploratory corpus by training a group 
of students on an ITS that makes random yet reasonable 
decisions and then apply RL to induce pedagogical policies 
from that exploratory training corpus. An empirical study 
was then conducted from a new group of human subjects 
interacting with different versions of the system. The only 
difference among the versions was the policy employed by 
the ITS. Lastly, the students’ performance was statistically 
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compared. Due to cost limitations, typically, only the best 
RL-induced policy was deployed and compared against some 
baseline policies. 


When applying offline DRL to ITSs, we often face one major 
challenge: our rewards are often not only noisy but also de- 
layed. Given the nature of ITS data collection, the training 
data including our reward functions is often noisy and our 
rewards are only the incomplete or imperfect observations 
of underlying true reward mechanisms. Due to the complex 
nature of student learning, the most appropriate rewards are 
(delayed) student learning gains, which are only available af- 
ter the entire training is complete. For example, hints might 
improve immediate performance but negatively impact over- 
all learning. On the other hand, when the size of the training 
data is limited, the availability of “true” immediate rewards 
is very important for offline RL. Immediate rewards are gen- 
erally more effective than delayed rewards for offline RL be- 
cause it is easier to assign appropriate credit or blame when 
the feedback is tied to a single decision. The more we de- 
lay rewards or punishments, the harder it becomes to assign 
credit or blame properly. Therefore, the challenge is how to 
distribute the delayed rewards to observable, immediate re- 
wards along each student-system interactive trajectory while 
taking the noise and uncertainty in the data into account. 
To tackle this issue, we applied a Gaussian Processes based 
(GP-based) approach to infer “immediate rewards” from the 
delayed rewards and then applied DQN to induce two poli- 
cies: one based on delayed rewards and the other based on 
the inferred immediate rewards, referred to as DQN-Del and 
DQN-Inf respectively. 


In this work, we used a logic ITS and focused on apply- 
ing DRL to induce a policy on one type of tutorial deci- 
sion: whether to present a given problem as a problem solv- 
ing (PS) or a worked example (WE). The tutor presents 
a worked example (WE) by demonstrating the individual 
steps in an expert solution to a problem. During PS, stu- 
dents are required to complete the problem with tutor sup- 
port (e.g. hints). The effectiveness of DQN-Del and DQN- 
Inf are evaluated theoretically using Expected Cumulative 
Reward (ECR) and empirically through two randomly con- 
trolled experiments: one for evaluating the effectiveness of 
DQN-Del in Spring 2018 and the other for evaluating DQN- 
Inf in Fall 2018. In each experiment, the effectiveness of the 
corresponding RL-induced policy was compared against the 
Random policy that flips a coin to decide between WE/PS 
and the students were randomly assigned into the two con- 
ditions while balancing their incoming competence. Overall, 
the results from both experiments showed no significant dif- 
ference between the DQN-Del and Random in Spring 2018 
and between the DQN-Inf and Random in Fall 2018 on every 
measure of learning performance. 


There are two potential explanations for such findings. First, 
our random baseline policy is decently strong. While ran- 
dom policies are usually bad in many RL tasks, in the con- 
text of WE vs. PS, our random policies can be strong base- 
lines. Indeed, some learning literature suggests that the best 
instructional intervention is to alternate WE and PS [35, 
41, 36]. Second, there may be an aptitude-treatment in- 
teraction (ATI) effect [6, 47], where certain students are 
less sensitive to the induced policies, meaning they achieve a 


similar learning performance regardless of policies employed; 
whereas other students are more sensitive, meaning their 
learning is highly dependent on the effectiveness of the poli- 
cies. Thus, we divided the students into High vs. Low based 
on their incoming competence and investigated the ATI ef- 
fect. While no ATI effect was found between DQN-Del and 
Random for Spring 2018, a significant ATT effect was found 
between DQN-Inf and Random in Fall 2018. 


In short, we explored applying offline DRL for pedagogical 
policy induction based on delayed and inferred immediate 
rewards. Our results showed that no ATI effect was found 
between DQN-Del and Random in Spring 2018, whereas 
there was an ATI effect between DQN-Inf and Random in 
Fall 2018. More specifically, the High incoming competence 
group benefited significantly more from the DQN-Inf policy 
than their peers in the Random condition. This result sug- 
gests that the availability of inferred immediate rewards was 
crucial for effectively applying offline DRL for pedagogical 
policy induction. 


2. BACKGROUND 


A great deal of research has investigated the differing im- 
pacts of worked examples (WE) and problem solving (PS) 
on student learning [49, 22, 21, 23, 41, 27, 36]. McLaren 
and colleagues compared WE-PS pairs with PS-only [22]. 
Every student was given a total of 10 training problems. 
Students in the PS-only condition were required to solve ev- 
ery problem while students in the WE-PS condition were 
given 5 example-problem pairs. Each pair consisted of an 
initial worked example problem followed by tutored prob- 
lem solving. They found no significant difference in learning 
performance between the two conditions. However, the WE- 
PS group spent significantly less time than the PS group. 


McLaren and his colleagues found similar results in two sub- 
sequent studies [21, 23]. In the former, the authors com- 
pared three conditions: WE, PS and WE-PS pairs, in the 
domain of high school chemistry. All students were given 10 
identical problems. As before, the authors found no signifi- 
cant differences among the three groups in terms of learning 
gains but the WE group spent significantly less time than 
the other two conditions; and no significant time on task dif- 
ference was found between the PS and WE-PS conditions. 


In a follow-up study, conducted in the domain of high school 
stoichiometry, McLaren and colleagues compared four con- 
ditions: WE, tutored PS, untutored PS, and Erroneous Ex- 
amples (EE) [23]. Students in the EE condition were given 
incorrect worked examples containing between 1 and 4 errors 
and were tasked with correcting them. The authors found 
no significant differences among the conditions in terms of 
learning gains, and as before the WE students spent signif- 
icantly less time than the other groups. More specifically, 
for time on task, they found that: WE < EE < untutored 
PS < tutored PS. In fact, the WE students spent only 30% 
of the total time that the tutored PS students spent. 


The advantages of WEs were also demonstrated in another 
study in the domain of electrical circuits [50]. The authors 
of that study compared four conditions: WE, WE-PS pairs, 
PS-WE pairs (problem-solving followed by an example prob- 
lem), and PS only. They found that the WE and WE-PS 
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students significantly outperformed the other two groups, 
and no significant differences were found among four condi- 
tions in terms of time on task. 


In short, prior research has shown that WE can be similar 
or more effective than PS or alternating PS with WE, and 
the former can take significantly less time than the latter 
two [49, 22, 21, 23, 41]. However, there is no widespread 
consensus on how or when WE vs. PS should be used. This 
is why we will derive pedagogical strategies for them directly 
from empirical data. 


2.1 ATI Effect 


Previous work shows that the ATI effect commonly exists 
in many real-world studies. More formally, the ATI effect 
states that instructional treatments are more or less effective 
to individual learners depending on their abilities [6]. For 
example, Kalyuga et al. [17] empirically evaluated the effec- 
tiveness of worked example (WE) vs. problem solving (PS) 
on student learning in programmable logic. Their results 
show that WE is more effective for inexperienced students 
while PS is more effective for experienced learners. 


Moreover, D’Mello et al. [7] compared two versions of ITSs: 
one is an affect-sensitive tutor which selects the next prob- 
lem based on students’ affective and cognitive states com- 
bined, while the other is an original tutor which selects the 
next problem based on students’ cognitive states alone. An 
empirical study shows that there is no significant difference 
between the two tutors for students with high prior knowl- 
edge. However, there is a significant difference for students 
with low prior knowledge: those who trained on the affect- 
sensitive tutor had significantly higher learning gain than 
their peers using the original tutor. 


Chi and VanLehn [4] investigated the ATI effect in the do- 
main of probability and physics, and their results showed 
that high competence students can learn regardless of in- 
structional interventions, while for students with low com- 
petence, those who follow the effective instructional inter- 
ventions learned significantly more than those who did not. 
Shen and Chi [43] find that for pedagogical decisions on WE 
vs. PS, certain learners are always less sensitive in that their 
learning is not affected, while others are more sensitive to 
variations in different policies. In their study, they divided 
students into Fast and Slow groups based on time, and found 
that the Slow groups are more sensitive to the pedagogical 
decisions while the Fast groups are less sensitive. 


3. RELATED WORK 


Deep Reinforcement Learning: In recent years, many 
DRL algorithms have been developed for various applica- 
tions such as board games like Go [44, 46], Chess and Shogi 
[45], robotic hand dexterity [33, 1], physics simulators [19, 
29, 30], and so forth. While most DRL algorithms have 
been mainly applied online, some of them can also be ap- 
plied offine. More specifically, DRL algorithms such Vanilla 
Policy Gradient (VPG) [48], Proximal Policy Optimization 
(PPO) [39], Trust Region Policy Optimization (TRPO) [38], 
or A3C [24] can only be applied for online learning by inter- 
acting with simulations. Some other DRL algorithms can be 
applied for offline learning using pre-collected training data. 
These include the Q-learning based approaches such as Deep 


Q-Network (DQN) [26], Double-DQN [51], prioritized expe- 
rience replay [37], distributed prioritized experience replay 
(Ape-X DQN) [14], and the Actor-Critic based methods such 
as Deep Deterministic Policy Gradient (DDPG) [19], Twin 
Delayed Deep Deterministic policy gradient (TD3) [9], or 
Soft Actor-Critic (SAC) [11]. Among them, DQN and its 
variants have been much more extensively studied, however, 
it is still not clear whether they can be successfully applied 
offline for pedagogical policy induction for ITSs. 


Reinforcement Learning in Education: Prior research 
using online RL to induce pedagogical policies has often re- 
lied on simulations or simulated students, and the success of 
RL is often heavily dependent on the accuracy of the simu- 
lations. Beck et al. [3] applied temporal difference learning, 
with off-policy e-greedy exploration, to induce pedagogical 
policies that would minimize student time on task. Igle- 
sias et al. applied another common online approach named 
Q-learning to induce policies for efficient learning [15, 16]. 
More recently, Rafferty et al. applied POMDP with tree 
search to induce policies for faster learning [32]. Wang et 
al. applied an online Deep-RL approach to induce a policy 
for adaptive narrative generation in educational game [52]. 
All of the models described above were evaluated by com- 
paring the induced policy with some baseline policies via 
simulations or classroom studies. 


Offline RL approaches, on the other hand, “take advantage 
of previously collected samples, and generally provide ro- 
bust convergence guarantees” [40]. Shen et al. applied value 
iteration and least square policy iteration on a pre-collected 
training corpus to induce pedagogical policies for improv- 
ing students’ learning performance [43, 42]. Chi et al. ap- 
plied policy iteration to induce a pedagogical policy aimed 
at improving students’ learning gains [5]. Mandel et al. 
[20] applied an offline POMDP approach to induce a policy 
which aims to improve student performance in an educa- 
tional game. In classroom studies, most models above were 
found to yield certain improved student learning relative to 
a baseline policy. 


DRL in Education is a subject of growing interest. DRL 
adds deep neural networks to RL frameworks such as POMDP 
for function approximation or state approximation [25, 26]. 
This enhancement makes the agent capable of achieving 
complicated tasks. Wang et al. [52] applied a DRL frame- 
work for personalizing interactive narratives in an educa- 
tional game called CRYSTAL ISLAND. They designed the im- 
mediate rewards based on normalized learning gain (NLG) 
and found that the students with the DRL policy achieved a 
higher NLG score than those following the linear RL model 
in simulation studies. Furthermore, Narasimhan et al. [28] 
implemented a Deep Q-Network (DQN) approach in text- 
based strategy games, constructed based on Evennia, which 
is an open-source library and toolkit for building multi-users 
online text-based games. Using simulations, they found that 
the DRL policy significantly outperformed the random pol- 
icy in terms of quest completion. 


In summary, compared with MDP and POMDP, relatively 
little research has been done on successfully applying DRL 
to the field of ITS. None of the prior research has success- 
fully applied DRL to ITSs without simulated environments, 
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in order to learn an effective pedagogical strategy that makes 
students learn in a more efficient manner. Furthermore, no 
prior work has empirically evaluated any DRL-induced pol- 
icy to confirm its benefits on real students. 


4. METHODS 


In RL, the agent interacts with an environment €, and the 
goal of the agent is to learn a policy that will maximize 
the sum of future discounted rewards (also known as the 
return) along the trajectories, where each trajectory is one 
run through the environment, starting in an initial state and 
ending in a final state. This is done by learning which action 
to take for each possible state. In our case, € is the learning 
context, and the agent must learn to take the actions that 
lead to the optimal student learning, by maximizing the re- 
turn R= )7/_, y'rt, where r; is the reward at time step ¢, 
T is the time step that indicates the end of the trajectory, 
and ¥ € (0, 1] is the discount factor. 


4.1 DQN and Double-DQN 

Deep Q-Network (DQN) is, fundamentally, a version of 
Q-learning. In Q-learning, the goal is to learn the optimal 
action-value function, Q*(s,@), which is defined as the ex- 
pected reward obtained when taking the optimal action a in 
state s, and following the optimal policy * until the end of 
the trajectory. For any state-action pair, the optimal action- 
value function must follow the Bellman optimality equation 
in that: 


Q"(8,a) =r + ymax Q"(s',a’) (1) 


Here r is the expected immediate reward for taking action 
a at state s; y is the discount factor; and Q*(s’,a’) is the 
optimal action-value function for taking action a’ at the sub- 
sequent state s’ and following policy 1* thereafter. 


Compared with the original Q-leaning, DQNs use neural net- 
works (NNs) to approximate action-value functions. This is 
because NNs are great universal function approximators and 
they are able to handle continuous values in both their in- 
puts and outputs. In order to train the DQN algorithm, 
two neural networks with equal architectures are employed. 
One is the main network and its weights are denoted @ and 
the other is the target network, and its weights are de 
noted @~. The target value used to train the network is 
y =r+ ymax, Q(s’,a';4—). Thus, the loss function that 
is minimized in order to train the main network is: 


Loss(8) = El(y — Q(s, a; 8))”] (2) 


The main network is trained on every training iteration, 
while the target network is frozen for a number of train- 
ing iterations. Every m training iterations, the weights of 
the main neural network are copied into the target network. 
This is one of the techniques used in order to avoid diver- 
gence during the training process. Another one of these 
techniques was the use of an experience replay buffer. This 
buffer contains the p most recent (s,a,r) tuples, and the 
algorithm randomly samples from the buffer when creating 
the batch on each training iteration. We followed the same 
procedure, but as our training was performed offline, the 
experience replay buffer consists of all the samples on our 
training corpus, and it does not get refreshed over time. 


Double-DQN or DDQN was proposed by Van Hasselt et 
al. [12] who combined it with neural networks in the Double- 
DQN algorithm [51]. The intuition behind it is to decouple 
the action selection from the action evaluation. To achieve 
this, the Double-DQN algorithm uses the main neural net- 
work to first select the action that has the highest Q-value 
for the next state (argmax,, Q(s’,a’,@)) and then evaluates 
the Q-value of the selected action using the target network 
(Q(s’, argmax,, Q(s’,a’;@);07)). This simple trick has been 
proven to significantly reduce overestimations in Q-value cal- 
culations, resulting in better final policies. With this tech- 
nique, the target value used to optimize the main network 
becomes: 


yisrt 7Q(s’, argmax Q(s’,a’, 8); 67) (3) 


The loss function is still the same as in equation 2, but the 
target value y used in the formula is now updated to be the 
one in equation 3. 


4.2 Fully Connected vs. LSTM 


For our NN architectures, we explored two options: Fully 
connected NNs and Long Short Term Memory (LSTM). 


Fully Connected or multi-layer perceptrons are the sim- 
plest form of neural network units. They calculate a simple 
weighted sum of all the input units, and each unit produces 
an output value that is often passed to an activation func- 
tion. We used these units to parametrize our neural net- 
works. All the input units are connected to all the units in 
the first hidden layer, and all those units are connected to 
every unit in the next hidden layer. This process continues 
until the final output layer. 


eran 


> ie 


Input 


Figure 1: A single LSTM unit containing a forget, input and 
ouput gate 


Long Short-Term Memory (LSTM) is a type of re- 
current neural network specifically designed to avoid the 
vanishing and exploding gradient problems [13]. LSTMs 
are particularly suitable for tasks where long-term tempo- 
ral dependencies must be remembered. They achieve this 
by maintaining the previous information of hidden states as 
internal memory. Figure 1 shows the architecture of a single 
LSTM unit. It consists of a memory cell state denoted by 
C, and three gates: the forget gate f; € [0,1], the input gate 
it € [0,1], and the output gate o; € [0,1]. These three gates 
interact with each other to control the flow of information. 
During training, the network learns what to memorize and 
when to allow writing to the cell in order to minimize the 
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training error. More specifically, the forget gate determines 
what information from the previous memory cell state is 
expired and should be removed; the input gate selects infor- 
mation from the candidate memory cell state Cf to update 
the cell state; and the output gate filters the information 
from the memory cell so that the model only considers in- 
formation relevant to the prediction task. The value of each 
gate is computed as follows, where W;;,7,c,0) are the weight 
matrices and bj;,7,c,0) are the bias vectors: 


ig = sigmoid(W; - [yz-1, Xz] + 02) 
fe = sigmoid(W; - [yz-1, Xt] + bf) 
Cf = tanh(We - [ye—-1, Xz] + be) 
oz = sigmoid(W, - [yz-1, Xz] + bo) 


The memory cell value C; and output value y% from the 
LSTM unit are computed using the following formulas: 
Cr=Cr-1-fe + Cl te (5) 
Yt = O14 * tanh(C;) 
4.3 Inferring Immediate Rewards 
A historical dataset consists of m trajectories, hy to hm 
and nm unknown immediate rewards. We would like to in- 
fer the immediate rewards given delayed rewards. In order 
to infer the immediate rewards, we used a minimum mean 
square error (MMSE) estimator in the Bayesian setting [18, 
8, 10]. Assume R = Dr +e is a linear process where D 
is a known matrix, r is a n x 1 random vector of unknown 
immediate rewards, R is a m x 1 vector of observed delayed 
rewards and «€ is a vector of independent and identically dis- 
tributed noise with mean of zero and standard deviation of 
or. Assuming the discounted sum of the immediate rewards 
is equal to the delayed rewards, a linear model matrix D is 
proposed as: 


hy ho 
en, 
1 4 “2 ... O vies 
p—| 0° ree | ee ae ae ee | ee (6) 


0 


where 7 is the discount factor. Following the linear MMSE 
estimator, we assume that the immediate rewards follow a 
Gaussian Process defined as r ~ N (ur, Crr) where pr is the 
a priori mean and Cry is the a priori covariance defined by 
an appropriate kernel [2]. Using the theorem of conditional 
distribution of multivariate Gaussian distributions [34], con- 
ditional expectation of immediate rewards given delayed re- 
wards E|r|R] or the posterior mean of immediate rewards 
is: 


E[r|R] = ur + CrrD" Curr (R —D pir) (7) 


and the posterior covariance C[r|R] of inferred immediate 
rewards given delayed rewards can be calculated as: 


C[r|R] = C,, — Cp-D7 Carr *DCZ. (8) 
where Cra = DC,,D? + oI and I is the identity matrix. 


Algorithm 1 shows the process used to infer the immediate 
rewards. Estimation of the mean and covariance of the ran- 


dom column vector r in Eqs. 7 and 8 requires the inverse of 
the matrix Crr. By introducing several intermediary vari- 
ables, this algorithm provides an efficient solution to matrix 
inversion using the Cholesky decomposition similar to the 
Gaussian Processes algorithm implementation [34]. 


Algorithm 1 Immediate reward approximation algorithm. 
Inputs: R, pr, Crr, D, o% 
L£ = Cholesky (DC;rD* + oR1) 
B= £\(R —Dyz,) forward-substitution algorithm 
=£ 


\8 back-substitution algorithm 
k= DC,, 
v=L\k 


B[r|R] = ur + ha 
C [r|R] = Crr — viv 
return: E[r|R] and C [r/R] 


5. POLICY INDUCTION 


In this section, we will describe our ITS, the training corpus, 
our policy induction procedure, and theoretical evaluation 
results. 


5.1 Logic ITS 


The logic tutor used in this study is named Deep Thought 
(DT), and it uses a graph-based environment to solve logic 
proofs. It is used in the undergraduate level Discrete Math- 
ematics class at North Carolina State University. To com- 
plete a problem, students iteratively apply rules to logic 
statement nodes in order to derive the conclusion node. DT 
automatically checks the correctness of each step and pro- 
vides immediate feedback on any rule that is applied incor- 
rectly. The tutor consists of 6 levels, with 3 to 4 problems 
per level. Each problem can be represented as Problem Solv- 
ing (PS) or as Worked Example (WE). Figure 2 (left) shows 
the user interface for PS, and Figure 2 (right) shows the 
interface for WE. 


EXAMPLE 


Figure 2: User Interface for DT. Left: PS. Right: WE. 


5.2. Training Corpus 

Our training corpus contains 786 complete student trajecto- 
ries collected over five semesters. On average, each student 
spent two hours to complete the tutor. For each student, the 
tutor makes about 19 decisions. From our student-system 
interaction logs, we extracted a total of 142 state features: 


e Autonomy: 10 features describing the amount of work 
done by the student. 


e Temporal: 29 features, including average time per 
step, the total time spent on the current level, the 
time spent on PS, the time spent on WE, and so on. 
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e Problem Solving: 35 features such as the difficulty of 
the current problem, the number of easy and difficult 
problems solved on the current level, the number of 
PS and WE problems seen in the current level, or the 
number of nodes the student added in order to reach 
the final solution. 


Performance: 57 features such as the numher of in- 
correct steps, and the ratio of correct to incorrect rule 
applications for different types of rules. 


Hints: 11 features such as the total number of hints 
requested or the number of hints the tutor provided 
without the student aaking for them. 


The features contain non-negative continuous values. As 
their range varies significantly (time can be a large num- 
ber while problem difficulty is always between 1 and 9), we 
normalized each feature to the range [0, 1]. Input feature 
normalization has been shown to improve the stability of 
the learning process on neural networks, and often leads to 
faster convergence. 


To induce our pedagogical policy, while previous research 
mainly used learning gains or time on task as reward func- 
tion, our reward function here is based on the improvement 
of learning effictency, which balances both learning gain im- 
provement and time on task improvement. In this way, if 
two students have the same amount of learning gain, the 
one who takes shorter time would get higher reward. To 
calculate their learning efficiency, we used students’ scores 
obtained on each level divided by the training time on the 
level. Students must solve the last problem on each level 
without help, and we use this as a level score. The range 
of the score for each level is [—100,+100], and the learning 
gain for level DE is calculated as Score, — Scorey_i, thus 
having a range of (—200, +200]. 


§3 Training Process 

For both DQN and Double DQN, we explored using Fully 
Connected (FC) NNa or using LSTM to estimate the action- 
value function Q. Our FC has four fully connected layers of 
128 units each, uses Rectified Linear Unit (ReLU) as the 
activation function. Our LSTM architecture consists of two 
layers of 100 LSTM units each, with a fully connected layer 
at the end. Additionally, for either FC or LSTM, for a given 
time ¢t, we explored three input settings: to use only the 
current state observation s; (kf = 1), to use the last two 
state observations: 4,1 and 3; {& = 2), and to use the last 
three: s¢-2, $¢-1 and 4; (k = 3). 


In the case of the fully connected (FC) model, the observa- 
tions are concatenated and passed to the input layer as a 
flat array of values. For LSTM, the input state observations 
are passed to the network in a sequential manner. These 
past observations provide extra information about the per- 
formance of the student in the previous states. However, 
including previous states also add complexity to the net- 
work, which can slow down the learning process and can 
increase the risk of converging to a weaker final policy. As 
the number of parameters increases in the NNs, the chance 
that our NN would get stuck at a local optima increases, es- 
pecially when our training data is limited. L2 regularization 


6- 

4-4 
“ 
a 24 

0 

—2- 

-20.03 
k=1 k=2 k 3lk-1 k 2 k=3| k=1 k=2 k: 3! k=1 k=2 k=3 
FC | LSTM | FC | LST™M 
7 DQN | DDQN 


Figure 3: Importance sampling results. 


was used to get a model that generalizes better. We trained 
our models for 50,000 iterations, using a batch size of 200. 


5.4 Induced Policy 

First, we induced the DQN-Del policy using delayed rewards 
only. Our training data was split: 90% of the students for 
training data and 10% for testing data. We trained all 12 
of our models (DQN and Double-DQN with either FC lay- 
ers or LSTM layers, and with k = {1,2,3}) on the training 
data and evaluated their performance on testing data. We 
repeated this process twice with two different test sets and 
reported their average performance on a series of popular 
off-policy evaluation metrics. Among them, Expected Cu- 
mulative Reward (ECR) is the most widely used. However, 
Per-Decision Importance Sampling (PDIS) has shown to be 
more robust [31] . 


ECR is simply calculated by averaging over the highest Q- 
value for all the initial states in the validation set. The 
formula is described in Equation 9. 


1 N 
EOR= 5 ) max Q(si,9) (9) 


#1 


8; is an initial state, and N denotes the number of trajecto- 
Ties in the validation set. 


PDIS [31] ia an alternative to regular Importance Sampling, 
to reduce variance in the estimations. The PDIS resulta 
of the 12 models are shown in Figure 3. The PDIS result 
of the random policy is used to set y = O (the red line) 
in Figure 3. Much to our surprise, while double DQN has 
shown to be much more robust in online DRL applications, 
its performance is generally worse than DQN here, especially 
when k = 1 and k = 2. Figure 3 shows that the best policy 
is induced using DQN with the LSTM architecture for k = 
3, and thus is selected as DQN-Del. We also compare the 
selected policy with the remaining ones using ECR and other 
evaluation metrica and the results showed using DQN with 
the LSTM architecture for k = 3 is always among the best 
policies across different evaluation metrics. 


To evaluate the impact of Inferred rewards on the DQN in- 


173 Proceedings of The 12th International Conference on Educational Data Mining (EDM 2019) 


— DON Inf 
--- DQN Del 


0 5000 10000 15000 20000 25000 30000 35000 40000 
Training iterations 


Figure 4: ECR evolution of DQN-Del and DQN-Inf. 


duced policies, we used the same approach to induce the 
DQN-Inf policy and the only major difference is that we used 
the inferred immediate rewards in the training dataset, cal- 
culated through Algorithm 1. During the training process, 
we calculated the ECRs of DQN with the LSTM architecture 
for k = 3 using the original delayed rewards (DQN-Del) vs. 
using the inferred immediate rewards (DQN-Inf). The evo- 
lution of the ECR values for each policy during the training 
process is shown in Figure 4, showing that using the inferred 
rewards we can theoretically converge faster and to a better 
policy. 


6. EMPIRICAL EXPERIMENT SETUP 


Two empirical experiments were conducted, one in the Spring 
2018 semester and one in the Fall 2018 semester. They were 
both conducted in the undergraduate Discrete Mathematics 
class at North Carolina State University. 


6.1 Experiment 1: Spring 2018 

84 students from the Spring 2018 class were randomly as- 
signed to the Random (control) group and the DQN-Del 
group. Because both WE and P%S are considered to be rea- 
sonable educational interventions in the context of learn- 
ing, we refer to our control random policy as a random yet 
reasonable policy or Random in the following. The assign- 
ment was done in a balanced random manner, using the pre- 
test score to ensure that the two groups had similar prior 
knowledge. N = 45 and N = 39 were assigned to Random 
and DQN-Del respectively. Among them, N = 41 Ran- 
dom students and N = 33 DQN-Del students completed 
the training. A x? test showed no significant differences 
between the completion rates of the two different groups: 
x? (1, N = 84) = 0.053, p = 0.817. 


6.2 Experiment 2: Fall 2018 

98 students from the Fall 2018 Discrete Mathematics class 
were distributed into two conditions. The two conditions 
are the Random (control) group and the DQN-Inf group. 
The group sizes were as follows: N = 49 for Random, and 
N = 49 for DQN-Inf. A total of 84 students completed the 
experiment and their distribution was as follows: N = 43 
for Random, and N = 41 for DQN-Inf. A x? test of inde- 


pendence showed no significant differences between the com- 
pletion rates of the two different groups: y*(1,N = 98) = 
0.025, p = 0.872. 


6.3 Performance Measure 

Our tutor is consisting of 6 strictly ordered levels of proof 
problems. All of the students received the same set of prob- 
lems in level 1. Their initial proficiency is calculated based 
upon the number of mistakes made on the final problem of 
level 1 and the total training time on level 1. The profi- 
ciency reflects how well they understand the knowledge and 
can apply the logic rules in the proof process before the tu- 
tor follows different pedagogical policies. In each sequential 
level, DT will follow the corresponding policies to determine 
the next problem to be WE or PS. The last problem on 
each level is used as a mini-posttest to measure students’ 
performance on that level. 


When inducing both the DQN-Del and DQN-Inf, we calcu- 
lated our reward function based upon the improvement of 
students’ learning efficiency which is defined as level scores 
divided by the training time on that level. So to measure stu- 
dent performance, we first calculate the learning efficiency 
on each level as: the score obtained by the student in the 
last problem of that level, divided by the total time (in min- 
utes). In this study, we use student learning efficiency in 
level 1 as their pretest efficiency score and their learning effi- 
ciency in level2-level6 as the post-test efficiency scores. Since 
our DQNs used learning efficiency improvement as their re- 
wards, we expect that the DRL-induced policy would cause 
students to have higher post-test efficiencies. 


7. RESULTS 
7.1 Experiment 1 Results 


No significant difference was found on the pre-test efficiency 
between the Random and DQN-Del: (72) = 1.086,p = 
0.281. We divided the students into high pre-test efficiency 
(n = 37) and low pre-test efficiency (n = 37) groups, based 
upon their learning efficiency on the pre-test. As expected, 
there was a significant difference between the high and low 
efficiency students on their pre-test efficiency: t(72) = 9.570, 
p < 0.001. The partition mentioned above resulted in four 
groups, based upon their incoming efficiency and condition: 
DQN-Del-High (n = 16), DQN-Del-Low (n = 17), Random- 
High (n = 21), and Random-Low (n = 20). A t-test showed 
no significant difference in the pre-test efficiencies either be- 
tween the two low groups, Random-Low and DQN-Del-Low, 
or between the two high efficiency groups. These results 
show that there is no significant difference in the pre-test 
efficiency across conditions. 


A two-way ANCOVA test on the post-test efficiency, using 
Condition {Random, DQN-Del} and Incoming Competency 
{Low, High} as factors and pre-test efficiency as a covariate, 
showed that there is no significant main effect of Condition 
F(1,69) = 2.633, p = 0.109, and no significant main effect 
of Incoming Efficiency F(1,69) = 0.036, p = 0.849. No in- 
teraction (ATI) effect was found either F'(1,69) = 1.285, 
p = 0.261. Thus, we conclude there was no difference be- 
tween the two conditions in the Spring 2018 study. 
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Figure 5: Post-Test Learning Efficiency across different 
groups for the Fall 2018 study. 


7.2 Experiment 2 Results 

In fall 2018, again no significant difference was found on 
the pre-test efficiency between the Random and DQN-Inf 
groups: ¢(82) = —0.333, p = 0.739. The students were also 
divided into high pre-test efficiency (n = 42) and low pre- 
test efficiency (n = 42) groups. A t-test showed a significant 
difference between the high and low efficiency students on 
the pre-test efficiency: ¢(82) = 6.38,p < 0.001. The same 
four groups were formed, based upon their incoming effi- 
ciency and condition: DQN-Inf-High (n = 20), DQN-Inf 
Low (n = 21), Random-High (n = 22), and Random-Low 
(n = 21). A t-test showed no significant difference on the 
pre-test efficiencies when comparing the Random-Low and 
DQN-Inf&Low groups: £(40) = 0.027, p = 0.978. No signifi- 
cant difference was found either, when performing a t-test on 
the two high efficiency groups: 1(40) = —0.698, p = 0.489. 
This shows that there is no significant difference on the pre- 
test efficiency across conditions during the Fall 2018 study. 


A two-way ANOVA test using Condition {Random, DQN- 
Inf} and Incoming Competency {Low, High} as two factors 
showed a significant interaction effect on students’ post-test 
efficiency: F'(1,80) = 5.038,p = 0.027 (as shown in Fig- 
ure 5). To be more strict, we ran a two-way ANCOVA test 
using Condition and Incoming Competency as two factors 
and pre-test efficiency as a covariate. This analysis also 
showed a significant interaction effect on students’ post-test 
efficiency: F(1,79) = 4.687, p = 0.033. Thus, by taking the 
pre-test efficiency into consideration, there is still a signifi- 
cant interaction effect. No significant main effect was found 
from either Condition or Incoming Competency. A one-way 
ANCOVA test on the post-test efficiency for the Low com- 
petency groups, using Condition {Random-Low, DQN-Inf- 
Low} as a factor and pre-test competency as a covariate 
showed no significant difference on the post-test efficiency 
F(1,39) = 0.429, p = 0.516. However, a significant dif- 
ference was found for the High groups F(1,39) = 5.513, 
p = 0.024, with means -0.719 for Random-High and 2.916 
for DQN-Inf- High (as shown in Figure 5). 


7.3 Log Analysis 

This section will show more details on the different types of 
tutorial decisions made across the different conditions and 
studies. The features that were analyzed include the total 
number of problems each student encountered (TotalCount), 
the number of problems solved (PSCount), the number of 
difficult problems solved (diffPSCount), the number of WEs 
seen (WECount), and the number of difficult WEs seen (dif- 
fWECount). Table 1 shows the summary of these five fea- 
tures for each condition and study. Columns 3 and 4 show 
the mean and standard deviation of each condition for these 
categories. Column 5 shows the statistical results of different 
t-tests comparing the two conditions. 


No significant difference is found for the total number of 
problems seen by each group. However, we observed that 
for the features diffPSCount, WECount and diffWECount, 
a significant difference was found only during the Spring 
2018 study. Looking at the mean values, we notice that the 
DQN-Del policy assigned fewer WE and more PS problems. 
However, this did not improve the performance of the stu- 
dents in the DQN-Del group during this study. During the 
Fall 2018 study, we only observe a significant difference in 
the number of PS problems assigned. No significant differ- 
ence was found in the remaining categories. 


When we analyze the logs for the High competency students, 
table 2 shows the values of those same features, but only 
for the High competency students in each study. During 
the Spring 2018 semester, we find a statistically significant 
difference for TotalCount, PSCount, and diffWECount, and 
we find a marginal difference for WECount. This shows that 
the DQN-Del policy gave more PS problems, fewer WE, and 
fewer difficult WE problems, but no significant difference 
was found in students’ post-test performance. The Fall 2018 
study results show no significant or marginal difference in 
any of the five categories. Despite this fact, the DQN-Inf 
policy implemented in the Fall 2018 study outperformed the 
Random policy for the High competency students. We can 
also observe how, in Table 2, the standard deviation for the 
DQN groups is often larger than the standard deviation for 
the Random groups. This makes sense because we expect all 
the students in the Random group to have a similar values 
in each category. However, it looks like the DQN policy 
is assigning more PS to certain students, and more WE to 
other students, resulting in a larger standard deviation. 


In short, our log analysis results show that it is not about 
the total amount of PSs and WEs that students received 
that matters, but rather how or when they receive which. 


8. CONCLUSIONS 


We used offline Deep Reinforcement Learning algorithms in 
conjunction with inferred immediate rewards to induce a 
pedagogical policy to improve the students’ learning effi- 
ciency for a logic tutor. Our results showed that our DRL- 
induced pedagogical policy can outperform the Random pol- 
icy, which is a strong baseline here. More specifically, there 
was an ATI effect in the Fall 2018 study in that the high in- 
coming competency students were benefited more from our 
DRL-induced policy, by achieving better post-test learning 
efficiency than other groups. Our results showed that our 
proposed Gaussian Processes based approach to infer “im- 
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Table 1: Log analysis results for per semester and condition. 
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Table 2: Log analysis results for the high competency groups per semester. 
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mediate rewards” from the delayed rewards seems reasonable 
and works pretty well here. Thus, offline DRL can be suc- 
cessfully applied to real-life environments even with a limited 
training dataset with delayed rewards. 
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